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ABSTRACT 

The quality of web sources has been traditionally evaluated using 
exogenous signals such as the hyperlink structure of the graph. We 
propose a new approach that relies on endogenous signals, namely, 
the correctness of factual information provided by the source. A 
source that has few false facts is considered to be trustworthy. 

The facts are automatically extracted from each source by infor¬ 
mation extraction methods commonly used to construct knowledge 
bases. We propose a way to distinguish errors made in the extrac¬ 
tion process from factual errors in the web source per se, by using 
joint inference in a novel multi-layer probabilistic model. 

We call the trustworthiness score we computed Knowledge-Based 
Trust (KBT). On synthetic data, we show that our method can re¬ 
liably compute the true trustworthiness levels of the sources. We 
then apply it to a database of 2.8B facts extracted from the web, 
and thereby estimate the trustworthiness of 119M webpages. Man¬ 
ual evaluation of a subset of the results confirms the effectiveness 
of the method. 

1. INTRODUCTION 

“Learning to trust is one of life’s most difficult tasks.” 

- Isaac Watts. 

Quality assessment for web source^ is of tremendous impor¬ 
tance in web search. It has been traditionally evaluated using ex¬ 
ogenous signals such as hyperlinks and browsing history. However, 
such signals mostly capture how popular a webpage is. For exam¬ 
ple, the gossip websites listed in mostly have high PageRank 
scores 0, but would not generally be considered reliable. Con¬ 
versely, some less popular websites nevertheless have very accurate 
information. 

In this paper, we address the fundamental question of estimating 
how trustworthy a given web source is. Informally, we define the 
trustworthiness or accuracy of a web source as the probability that 

^ We use the term “web source” to denote a specific webpage, such 
as wiki . com/pagel. or a whole website, such as wiki . com 
We discuss this distinction in more detail in Section]^ 
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it contains the correct value for a fact (such as Barack Obama’s na¬ 
tionality), assuming that it mentions any value for that fact. (Thus 
we do not penalize sources that have few facts, so long as they are 
correct.) 

We propose using Knowledge-Based Trust (KBT) to estimate source 
trustworthiness as follows. We extract a plurality of facts from 
many pages using information extraction techniques. We then Jointly 
estimate the correctness of these facts and the accuracy of the sources 
using inference in a probabilistic model. Inference is an iterative 
process, since we believe a source is accurate if its facts are correct, 
and we believe the facts are correct if they are extracted from an ac¬ 
curate source. We leverage the redundancy of information on the 
web to break the symmetry. Furthermore, we show how to initial¬ 
ize our estimate of the accuracy of sources based on authoritative 
information, in order to ensure that this iterative process converges 
to a good solution. 

The fact extraction process we use is based on the Knowledge 
Vault (KV) project fTO) . KV uses 16 different information ex¬ 
traction systems to extract (subject, predicate, object) knowledge 
triples from webpages. An example of such a triple is (Barack 
Obama, nationality, USA). A subject represents a real-world entity, 
identified by an ID such as mids in Freebase 0 ; a predicate is pre¬ 
defined in Freebase, describing a particular attribute of an entity; 
an object can be an entity, a string, a numerical value, or a date. 

The facts extracted by automatic methods such as KV may be 
wrong. One method for estimating if they are correct or not was 
described in 0 . However, this earlier work did not distinguish be¬ 
tween factual errors on the page and errors made by the extraction 
system. As shown in 0 , extraction errors are far more prevalent 
than source errors. Ignoring this distinction can cause us to incor¬ 
rectly distrust a website. 

Another problem with the approach used in 0 is that it es¬ 
timates the reliability of each webpage independently. This can 
cause problems when data are sparse. For example, for more than 
one billion webpages, KV is only able to extract a single triple 
(other extraction systems have similar limitations). This makes it 
difficult to reliably estimate the trustworthiness of such sources. 
On the other hand, for some pages KV extracts tens of thousands 
of triples, which can create computational bottlenecks. 

The KBT method introduced in this paper overcomes some of 
these previous weaknesses. In particular, our contributions are three¬ 
fold. Our main contribution is a more sophisticated probabilistic 
model, which can distinguish between two main sources of error: 
incorrect facts on a page, and incorrect extractions made by an ex¬ 
traction system. This provides a much more accurate estimate of 
the source reliability. We propose an efficient, scalable algorithm 
for performing inference and parameter estimation in the proposed 
probabilistic model (Section^. 



Table 1: Summary of major notations used in the paper. 


Notation 

Description 

u; G W 

Web source 

e G ^ 

Extractor 

d 

Data item 

V 

Value 

Xewdv 

Binary indication of whether e extracts (d,v) from w 

Xwdv 

All extractions from w about (d, u) 

Xd 

All data about data item d 

X 

All input data 

Cwdv 

Binary indication of whether w provides (d, v) 

Tdv 

Binary indication of whether u is a correct value for d 

Vd 

True value for data item d under single-truth assumption 

Aw 

Accuracy of web source w 

Be, Re 

Precision and recall of extractor e 


Sources 



(a) Single layer input (b) Multi layer input 

Figure 1: Form of the input data for (a) the single-layer model and 
(b) the multi-layer model. 

Our second contribution is a new method to adaptively decide 
the granularity of sources to work with: if a specific webpage yields 
too few triples, we may aggregate it with other webpages from the 
same website. Conversely, if a website has too many triples, we 
may split it into smaller ones, to avoid computational bottlenecks 
(Section]^. 

The third contribution of this paper is a detailed, large-scale eval¬ 
uation of the performance of our model. In particular, we applied 
it to 2.8 billion triples extracted from the web, and were thus able 
to reliably predict the trustworthiness of 119 million webpages and 
5.6 million websites (Section|^. 

We note that source trustworthiness provides an additional sig¬ 
nal for evaluating the quality of a website. We discuss new research 
opportunities for improving it and using it in conjunction with ex¬ 
isting signals such as PageRank (Section [5.4.2| ). Also, we note that 
although we present our methods in the context of knowledge ex¬ 
traction, the general approach we propose can be applied to many 
other tasks that involve data integration and data cleaning. 

2. PROBLEM DEFINITION AND OVERVIEW 

In this section, we start with a formal definition of Knowledge- 
based trust (KBT). We then briefly review our prior work that solves 
a closely related problem, knowledge fusion (n). Finally, we give 
an overview of our approach, and summarize the difference from 
our prior work. 

2.1 Problem definition 

We are given a set of web sources W and a set of extractors 8 . An 
extractor is a method for extracting (subject, predicate, 
object) triples from a webpage. For example, one extractor 
might look for the pattern “$A, the president of %B, ...”, from 
which it can extract the triple (A, nationality, B). Of course, this 
is not always correct (e.g., if A is the president of a company, not 


Table 2: Obama’s nationality extracted by 5 extractors from 8 web¬ 
pages. Column 2 (Value) shows the nationality truly provided by 
each source; Columns 3-7 show the nationality extracted by each 
extractor. Wrong extractions are shown in italics. 



Value 

El 

E2 

Es 

E4 

E5 

Wi 

USA 

USA 

USA 

USA 

USA 

Kenya 

W2 

USA 

USA 

USA 

USA 

N.Amer. 


W3 

USA 

USA 


USA 

N. Amer. 


W4 

USA 

USA 


USA 

Kenya 


W5 

Kenya 

Kenya 

Kenya 

Kenya 

Kenya 

Kenya 

We 

Kenya 

Kenya 


Kenya 

USA 


Wr 

- 



Kenya 


Kenya 

Ws 

- 





Kenya 


a country). In addition, an extractor reconciles the string represen¬ 
tations of entities into entity identifiers such as Freebase mids, and 
sometimes this fails too. It is the presence of these common ex¬ 
tractor errors, which are separate from source errors (Le., incorrect 
claims on a webpage), that motivates our work. 

In the rest of the paper, we represent such triples as (data item, 
value) pairs, where the data item is in the form of (subject, 
predicate) , describing a particular aspect of an entity, and the 
object serves as a value for the data item. We summarize the 
notation used in this paper in Table 

We define an observation variable Xewdv We set Xewdv = 1 
if extractor e extracted value v for data item d on web source w; 
if it did not extract such a value, we set Xewdv = 0. An extractor 
might also return confidence values indicating how confident it is 
in the c orrec tness of the extraction; we consider these extensions in 
Section [33] We use matrix X = {Xewdv} to denote all the data. 

We can represent X as a (sparse) “data cube”, as shown in Fig- 
ure[^b). Tablej^shows an example of a single horizontal “slice” of 
this cube for the case where the data item is d* = (Barack Obama, 
nationality). We discuss this example in more detail next. 

Example 2.1. Suppose we have 8 webpages, Wi — Ws, and 
suppose we are interested in the data item (Obama, nationality). 
The value stated for this data item by each of the webpages is shown 
in the left hand column of Table [^ We see that Wi — Wa pro¬ 
vide USA as the nationality of Obama, whereas W 5 — We provide 
Kenya (a false value). Pages W7 — Ws do not provide any infor¬ 
mation regarding Obama's nationality. 

Now suppose we have 5 different extractors of varying reliability. 
The values they extract for this data item from each of the 8 web¬ 
pages are shown in the table. Extractor Ei extracts all the provided 
triples correctly. Extractor E 2 misses some of the provided triples 
(false negatives), but all of its extractions are correct. Extractor 
E 3 extracts all the provided triples, but also wrongly extracts the 
value Kenya/rom W 7 , even though W 7 does not provide this value 
(a false positive). Extractor E 4 and E^ both have poor quality, 
missing a lot of provided triples and making numerous mistakes. □ 

For each web source w G W, we define its accuracy, denoted by 
Aw, as the probability that a value it provides for a fact is correct 
(i.e., consistent with the real world). We use A = for the 

set of all accuracy parameters. Finally, we can formally define the 
problem of KBT estimation. 

Definition 2.2 (KBT Estimation). Knowledge-Based 
Trust (KBT) estimation task is to estimate the web source accura¬ 
cies A = {A^(;} given the observation matrix X = {Xewdv} of 
extracted triples. □ 

2.2 Estimating the truth using a single-layer 
model 

KBT estimation is closely related to the knowledge fusion prob¬ 
lem we studied in our previous work | [TT| , where we evaluate the 





































true (but latent) values for each of the data items, given the noisy 
observations. We introduce the binary latent variables Tdv, which 
represent whether u is a correct value for data item d. Let T = 
{Tdv}- Given the observation matrix X = {Xewdv}, the knowl¬ 
edge fusion problem computes the posterior over the latent vari¬ 
ables, p{T\X). 

One way to solve this problem is to “reshape” the cube into a 
two-dimensional matrix, as shown in Figure [^a), by treating every 
combination of web page and extractor as a distinct data source. 
Now the data are in a form that standard data fusion techniques 
(surveyed in p^ ) expect. We call this a single-layer model, since 
it only has one layer of latent variables (representing the unknown 
values for the data items). We now review this model in detail, and 
we compare it with our work shortly. 

In our previous work GD we applied the probabilistic model 
described in |8J. We assume that each data item can only have a 
single true value. This assumption holds for functional predicates, 
such as nationality or date-of-birth, but is not technically valid for 
set-valued predicates, such as child. Nevertheless, 0 showed em¬ 
pirically that this “single truth” assumption works well in practice 
even for non-functional predicates, so we shall adopt it in this work 
for simplicity. (See for approaches to dealing with multi¬ 

valued attributes.) 

Based on the single-truth assumption, we define a latent variable 
Vd G dom(d) for each data item to present the true value for d, 
where dom((i) is the domain (set of possible values) for data item 
d. Let V = {Vd} and note that we can derive T = {Tdv} from 
V under the single-truth assumption. We then define the following 
observation model: 

p{Xs^, = l\V, = v*,A,) = l^ ( 1 ) 

where u* is the true value, s = {w,e) is the source. As G [0,1] 
is the accuracy of this data source, and n is the number of false 
values for this domain (i.e., we assume |dom(d)| = n + 1). The 
model says that the probability for s to provide a true value v* for 
d is its accuracy, whereas the probability for it to provide one of the 
n false values is 1 — divided by n. 

Given this model, it is simple to apply Bayes rule to compute 
p{Vd\Xd, A), where Xd = is all the data pertaining to 

data item d {i.e., the d’th row of the data matrix), and A = {^s} 
is the set of all accuracy parameters. Assuming a uniform prior for 
p{Vd), this can be done as follows: 


p{Vd = v\Xd,A) 


p{Xd\Vd = v,A) 

^v'edom{d) P(^d\Vd = v', A) 


( 2 ) 


where the likelihood function can be derived from Equation 
assuming independence of the data sources]^ 

p{Xd\Vd = v*,A)= jq p{X,dv = l\Vd = v*,A,) (3) 

s ,v:X — 1 


This model is called the ACCU model A slightly more ad¬ 
vanced model, known as POPACCU, removes the assumption that 
the wrong values are uniformly distributed. Instead, it uses the em¬ 
pirical distribution of values in the observed data. It was proved that 
the PopAccu model is monotonic; that is, adding more sources 
would not reduce the quality of results O 

In both Accu and PopAccu, it is necessary to jointly estimate 
the hidden values V = {Vd} and the accuracy parameters A = 

^Previous works discussed how to detect copying and correlations 

between sources in data fusion; however, scaling them up to billions of web 
sources remains an open problem. 


{As}. An iterative EM-like algorithm was proposed for performing 
this as follows (j^): 

• Set the iteration counter t = 0. 

• Initialize the parameters Al to some value (e.g., 0.8). 

• Estimate p{Vd\ Xd , ) in parallel for all d using Equation ^ 

(this is like the E step). Erom this we can compute the most 
probable value, Vd = argmaxp(Vd|^d, A*). 

• Estimate as follows: 

.t+i ^ E. E. = i)p{Vd = A^) 

where I(a = b) is 1 if a = 6 and is 0 otherwise. Intuitively 
this equation says that we estimate the accuracy of a source 
by the average probability of the facts it extracts. This equa¬ 
tion is like the M step in EM. 

• We now return to the E step, and iterate until convergence. 
Theoretical properties of this algorithm are discussed in j^. 

2.3 Estimating KBT using a multi-layer model 

Although estimating KBT is closely related to knowledge fu¬ 
sion, the single-layer model falls short in two aspects to solve the 
new problem. The first issue is its inability to assess trustworthi¬ 
ness of web sources independently of extractors; in other words. As 
is the accuracy of sl {w, e) pair, rather than the accuracy of a web 
source itself. Simply assuming all extracted values are actually pro¬ 
vided by the source obviously would not work. In our example, we 
may wrongly infer that Wi is a bad source because of the extracted 
Kenya value, although this is an extraction error. 

The second issue is the inability to properly assess truthfulness 
of triples. In our example, there are 12 sources {i.e., extractor- 
webpage pairs) for USA and 12 sources for Kenya’, this seems to 
suggest that USA and Kenya are equally likely to be true. However, 
intuitively this seems unreasonable: extractors Ei — E 3 all tend to 
agree with each other, and so seem to be reliable; we can therefore 
“explain away” the Kenya values extracted by E 4 — E^ as being 
more likely to be extraction errors. 

Solving these two problems requires us to distinguish extraction 
errors from source errors. In our example, we wish to distinguish 
correctly extracted true triples (e.g., USA from Wi — W 4 ), cor¬ 
rectly extracted false triples (e.g., Kenya from W 5 — Wq), wrongly 
extracted true triples (e.g., USA from We), and wrongly extracted 
false triples (e.g., Kenya from Wi, W 4 , Wj — Ws). 

In this paper, we present a new probabilistic model that can es¬ 
timate the accuracy of each web source, factoring out the noise 
introduced by the extractors. It differs from the single-layer model 
in two ways. Eirst, in addition to the latent variables to represent 
the true value of each data item (Vd), the new model introduces a 
set of latent variables to represent whether each extraction was cor¬ 
rect or not; this allows us to distinguish extraction errors and source 
data errors. Second, instead of using A to represent the accuracy 
of (e, w) pairs, the new model defines a set of parameters for the 
accuracy of the web sources, and for the quality of the extractors; 
this allows us to separate the quality of the sources from that of the 
extractors. We call the new model the multi-layer model, because it 
contains two layers of latent variables and parameters (Section|^. 

The fundamental differences between the multi-layer model and 
the single-layer model allow for reliable KBT estimation. In Sec¬ 
tion we also show how to dynamically select the granularity of 
a source and an extractor. Einally, in Section we show empiri¬ 
cally how both components play an important role in improving the 
performance over the single-layer model. 






3. MULTI-LAYER MODEL 

In this section, we describe in detail how we compute A = 
from our observation matrix X = {Xewdv} using a multi¬ 
layer model. 


3.1 The multi-layer model 

We extend the previous single-layer model in two ways. First, 
we introduce the binary latent variables Cwdv, which represent 
whether web source w actually provides triple (c?, v) or not. Sim¬ 
ilar to Equation ([^, these variables depend on the true values Vd 
and the accuracies of each of the web sources Aw as follows: 

p{C^av = m = v\A^) = i tlZv* 

V n ' 

Second, following p7][^ , we use a two-parameter noise model 
for the observed data, as follows: 


pi^X^wdv — '^iCwdv — C^Qe^Re) — 


Re 

Qe 


if c = 1 
ifc = 0 


( 6 ) 


Here Re is the recall of the extractor; that is, the probability of 
extracting a truly provided triple. And Qe is 1 minus the specificity, 
that is, the probability of extracting an unprovided triple. Parameter 
Qe is related to the recall {Re) and precision (Pe) as follows: 


Qe = P— • ■ Re (7) 

1-7 Pe 

where 7 = p{C-mdv = 1) for any v e dom(d), as explained in 
(Iz)- (Table gives a numerical example of computing Qe from 
Pe and Re.) 

To complete the specification of the model, we must specify the 
prior probability of the various model parameters: 

0l = {A^C=1,^2 = ({Pe}f=l,{Pe}f=l),^ = (^ 1 ,^ 2 ) (8) 


For simplicity, we use uniform priors on the parameters. By de¬ 
fault, we set Aw — 0.8, Re — 0.8, and Qe — 0.2. In Section[5] we 
discuss an alternative way to estimate the initial value of Aw, based 
on the fraction of correct triples that have been extracted from this 
source, using an external estimate of correctness (based on Free- 
base Q). 

Let y = {^4, C = {Cwdv}, and Z = {V, C) be all the latent 
variables. Our model defines the following joint distribution: 

p{x, z, e) = p{e)p{v)p{c\v, e^)p{x\c, 02) (9) 

We can represent the conditional independence assumptions we are 
making using a graphical model, as shown in Figurej^ The shaded 
node is an observed variable, representing the data; the unshaded 
nodes are hidden variables or parameters. The arrows indicate the 
dependence between the variables and parameters. The boxes are 
known as “plates” and represent repetition of the enclosed vari¬ 
ables; for example, the box of e repeats for every extractor e G 


3.2 Inference 

Recall that estimating KBT essentially requires us to compute 
the posterior over the parameters of interest, p{A\X). Doing this 
exactly is computationally intractable, because of the presence of 
the latent variables Z. One approach is to use a Monte Carlo ap¬ 
proximation, such as Gibbs sampling, as in |[^. However, this 
can be slow and is hard to implement in a Map-Reduce framework, 
which is required for the scale of data we use in this paper. 

A faster alternative is to use EM, which will return a point es¬ 
timate of all the parameters, 0 = argmaxp(^|W). Since we are 
using a uniform prior, this is equivalent to the maximum likelihood 
estimate 6 — argmaxp(X|^). From this, we can derive A. 



Figure 2: A representation of the multi-layer model using graphical 
model plate notation. 


Algorithm 1: MultiLAYER(X, trnax) 


Input \ X\ all extracted data; 

tmax'. max number of iterations. 
Output: Estimates of Z and 0. 

Initialize 0 to default values; 
for t G [Ij^max] do 

Estimate C by Eqs.([l5]J^[3T| 
Estimate V by Eqs.l 
Estimate 0i by Eq.( 


Estimate O 2 by Eqs!([T2|33 
if Z, 0 converge then 
^ break; 


9 return Z, 0 \ 


As pointed out in p6) , an exact EM algorithm has a quadratic 
complexity even for a single-layer model, so is unaffordable for 
data of web scale. Instead, we use an iterative “EM like” estimation 
procedure, where we initialize the parameters as described previ¬ 
ously, and then alternate between estimating Z and then estimating 
0 , until we converge. 

We first given an overview of this EM-like algorithm, and then 
go into details in the following sections. 

In our case, Z consists of two “layers” of variables. We update 
them sequentially, as follows. First, let Xwdv — {Xewdv} denote 
all extractions from web source w about a particular triple t — 
{d, v). We compute th e extra ction correctness p{Cwdv\Xwdv,0l) 
as explained in Section: 


3.3.1 


___ and then we compute Cwdv = argmax 

p{Cwdv\Xwdv, O 2 ), which is our best guess about the “true con¬ 
tents” of each web source. This can be done in parallel over d, w, v. 

Let Cd = Cwdv denote all the estimated values for d across 
the diff erent w ebsites. We compute p{Vd\Cd,0{), as explained in 
Section 3.3.2 and then we compute Vd = argmaxp(Vd|Gd, ^ 1 )’ 
which is our best guess about the “true value” of each data item. 
This can be done in parallel over d. 

Having estimated the latent variables, we then estimate 
This parameter update also consists of two steps (but can be done in 
parallel): estimating the source accuracies a nd the extractor 

reliabilities {Pe, Pe}, as explained in Section 

Algorithm gives a summary of the pseudo code; we give the 
details next. 


3.3 Estimating the latent variables 

We now give the details of how we estimate the latent variables 
Z. For notational brevity, we drop the conditioning on 0*, except 


































Table 3: Quality and vote counts of extractors in the motivating 
example. We assume 7 = .25 when we derive Qe from Pe and Re. 



El 

E2 

E3 

E4 

Ss 

Q{Ei) 

.01 

.01 

.06 

.22 

.17 

R{Ei) 

.99 

.5 

.99 

.33 

.17 

P{Ei) 

.99 

.99 

.85 

.33 

.25 

Pre{Ei) 

4.6 

3.9 

2.8 

.4 

0 

AbsiEi) 

-4.6 

-.7 

-4.5 

-.15 

0 


where needed. 


3.3.1 Estimating extraction correctness 

We first describe how to compute p{Cwdv = follow¬ 

ing the “multi-truth” model of (23 We will denote the prior prob¬ 
ability p(Cwdv = 1) by a. In initial iterations, we initialize this to 
a = 0.5. Note that by using a fixed prior, we break the connection 
between Cwdv and Vd in the graphical model, as shown in Figure]^ 
Thus, in subsequent iterations, we re-estimate p(Cwdv = 1) using 
the results of Vd obtained from the previous iteration, as explained 
in Section [ 33 ^ 

We use Bayes rule as follows: 


piCwdv — ^\^wdv') 


Otp(^Xxxjdv\Cxvdv — 1) 

OLp(^Xwdv\Cwdv ~ 1) T (1 Ol^p(^Xy^dv\Cwdv — 0) 

1 

1 -I 1 

P^^wdv I ^wdy—^) PC 

P(^wdv\^wdv=^) 1-0= 


= a 


f , piXu,dv\C^dv - 1 ) , a \ 

V ^ Pix^dvic^dv = 0) ^1-aJ 


( 10 ) 


where a{x) = is the sigmoid function. 

Assuming independence of the extractors, and using Equation 
we can compute the likelihood ratio as follows: 


Table 4: Extraction correctness and data item value distribution 
for the data in Table using the extraction parameters in Table 
Co lumn s 2-4 show p{Cwdv = MX^^dv), as explained in E xam- 
ple 3.1 The last row shows p( | ), as explained in Example ji!^ 

note that this distribution does not sum to 1.0, since not all of the 
values are shown in the table. 



USA 

Kenya 

N.Amer. 

Wi 

1 

0 

- 

W2 

1 

- 

0 

Ws 

1 

- 

0 

Wa 

1 

0 

- 

Wb 

- 

1 

- 

We 

0 

1 

- 

Wt 

- 

.07 

- 

Ws 

- 

0 

- 

p(VdlCd) 

.995 

.004 

0 


Now consider applying Equation to compute the likelihood 
that a particular source provides the triple t* = (Obama, national¬ 
ity, USA), assuming a = 0.5. For source Wi, we see that extrac¬ 
tors El — Ea extract V, so the vote count is (4.6 3.9 + 2.8 + 

0.4) + (0) = 11.7 and hence p{Ci^t* = l\Xu;,t*) = cr(11.7) = 
1. For source We, we see that only Ea extracts V, so the vote 
count is (0.4) + (—4.6 — 0.7 — 4.5 — 0) = —9.4, and hence 
piCQ^t* — l|X6,t*)) = cr(—9.4) = 0. Some other values for 
P(Cwt — l\Xwt) are shown in Table^ □ 

Having computedp(C^(;cZ^; = l|X^(;cz^;), we can compute = 
dxg-nid,y.p{Cwdv\Xwdv)- This serves as the input to the next step 
of inference. 

3.3.2 Estimating true value of the data item 
In this step, we compute piVd = v\Cd), following the “single 
truth” model of j^. By Bayes rule we have 


p{_Xyjdv\Cwdv - 1) 

p{_X xa dv\CXV dv — 0 ) 


n 

^•^ewdv —1 


Re 

Qe 


n 

ewdv —a 


1-Re 

l-Qe 


( 11 ) 


piVd = v\Cd) = 


p{Cd\Vd = v)p{Vd = v) 
E„'edom(d)KC'ci|Vi = v')p{Vd = v>) 


(16) 


In other words, for each extractor we can compute a presence 
vote Prce for a triple that it extracts, and an absence vote of AbSe 
for a triple that it does not extract: 

PrCe = logi^e-logQe (12) 

AbSe = log(l - i^e) - log(l - Qe). (13) 

Eor each triple (u;, d, v) we can compute its vote count as the 
sum of the presence votes and the absence votes: 

VCC{w^d^v)= ^ PrCe + ^ AbSe (14) 

ewdv — ^ ewdv—^ 

Accordingly, we can rewrite Equation as follows. 
p{C^dv = l\Xu,dv) = <7 fvcc{w, d, v) + log ^^) . (15) 


Example 3.1. Consider the extractors in the motivating exam¬ 
ple (Table^. Suppose we know Qe and Re for each extractor e as 
shown in Table We can then compute Free and Abse as shown 
in the same table. We observe that in general, an extractor with 
low Qe (unlikely to extract an unprovided triple; e.g., Ei,E 2 ) of¬ 
ten has a high presence vote; an extractor with high Re (likely to 
extract a provided triple; e.g., Ei^Es) often has a low (negative) 
absence vote; and a low-quality extractor (e.g., E^) often has a low 
presence vote and a high absence vote. 


Since we do not assume any prior knowledge of the correct values, 
we assume a uniform prior p(Vd = u), so we just need to focus on 
the likelihood. Using Equation we have 


p(Cd\Vd = v) 

- n n 

W'.Cy^dv —1 P^'-^wdv —0 

n nAxv T—r 1 — Axv 

I-Axv . 

'^■Cxvdv=^ '^■Cxvdv^WX} 


(17) 

(18) 


Since the latter term ^ 2:=^ is constant with respect 

to u, we can drop it. 

Now let us define the vote count as follows: 

VCV{w) ^ log -TV (19) 

1 J±xv 

Aggregating over web sources that provide this triple, we define 


VCV{d, = VVCV{w) (20) 

w 


With this notation, we can rewrite Equation ([T^ as 


p{Vd = v\Cd) 


e7cp{VCV{d,v)) 

T.eeAom{d)^MVCV{d,v')) 


( 21 ) 






































Example 3.2. Assume we have correctly decided the triple 
provided by each web source, as in the “Value” column ofTable^ 
Assume each source has the same accuracy Aw = 0.6 and n = 10, 
so the vote count is ln( 'g ) = 2.7. Then USA has vote count 
2.7 * 4 = 10.8, Kenya has vote count 2.7 * 2 = 5.4, and an un¬ 
provided value, such as NAmer, has vote count 0. Since there are 
10 false values in the domain, so there are 9 unprovided values. 
Hence we have piVd — USA\Cd) — = 0.995, where 

Z = exp(10.8) + exp(5.4) + exp(O) * 9. Similarly, piVa = 
Kenya\Cd) — = 0.004. This is shown in the last row 

ofTable^ The missing mass ofl — (0.995 + 0.004) is assigned 
(uniformly) to the other 9 values that were not observed (but in the 
domain). 

3.3.3 An improved estimation procedure 
So far, we have assumed that we first compute a MAP estimate 
Cwdv, which we then use as evidence for estimating Vd. However, 
this ignores the uncertainty in C. The correct thing to do is to 
compute p(Vd\Xd) marginalizing out over Cwdv 

p{Vd\Xd) oc P{Vd)P{Xd\Vd) 

= p{Vd)J2p(^d = S\Vd)p{Xd\Cd) ( 22 ) 

c 


Here we can consider each c as a possible world, where each el¬ 
ement Cwdv indicates whether a source w provides a triple (d, v) 
(value 1 ) or not (value 0 ). 

As a simple heuristic approximation to this approach, we replace 
the previous vote counting with a weighted version, as follows: 


VCV'iw,d,v) ^ p { C^dv = l \ Xd ) log fX 
VCV'{d,v) ^ 


We then compute 
p{Vd = v\Xd) 


exp{VCV'{d,v)) 

T..'edomid)^^PiyCV'{d,v')) 


(23) 

(24) 


(25) 


We will show that such improved estimation procedure i mprove s 
upon ignoring the uncertainty in Cd in experiments (Section 5.3.3| ). 


3.3.4 Re-estimating the prior of correctness 


In Section 3.3.1 we assumed that p{Cwdv = 1) = a was 
known, which breaks the connection between Vd and Cwdv Thus, 
we update this prior after each iteration according to the correctness 
of the value and the accuracy of the source: 

a= p{Vd = v\X)A^ + (1 - p{Vd = ^;|X))(1 - A^) (26) 


We can then use this refined estimate in the following iteration. We 
give an example of this process. 


Example 3.3. Consider the probability that W 7 provides t' = 
(Obama, nationality, Kenya). Two extractors extract t' from W7 
and the vote count is -2.65, so the initial estimate is p(Cwdv — 
l\X) = cr(—2.65) = 0.06. However, after the previous iteration 
has finished, we know that p{Vd = Kenya\X) = 0.04. This gives 
us a modified prior probability as follows: p'{Cwt = 1) = 0.004* 
0.6 + (1 — 0.004) * (1 — 0.6) = 0.4, assuming Aw = 0.6. Hence 
the updated posterior probability is given by p (Cwt — 1|^) = 
cr(—2.65 + log ^~°'^ ) = 0.04, which is lower than before. 

3.4 Estimating the quality parameters 

Having estimated the latent variables, we now estimate the pa¬ 
rameters of the model. 


3.4.1 Source quality 

Eollowing 1^, we estimate the accuracy of a source by comput¬ 
ing the average probability of its provided values being true: 

^t+i ^ j:dv:c^,„=iPiy<i = v\x) 

J2dv.c„dv = ^ ^ 


We can take uncertainty of C into account as follows: 

This is the key equation behind Knowledge-based Trust estimation’. 
it estimates the accuracy of a web source as the weighted average 
of the probability of the facts that it contains (provides), where the 
weights are the probability that these facts are indeed contained in 
that source. 


3.4.2 Extractor quality 

According to the definition of precision and recall, we can esti¬ 
mate them as follows: 


-^e 


At+1 

rCe 


'^wdv.Xe^d^ = lPi^'^dv — 1 |-^) 

V i 

J2wdv.X^^a^=lPi^^<iv = 1|X) 
'^wdvP(^'^‘^'" ~ 1|X) 


(29) 

(30) 


Note that for reasons explained in (?7) , it is much more reliable to 
estimate Pe and Re from data, and then compute Qe using Equa¬ 
tion 1 ^, rather than trying to estimate Qe directly. 


3.5 Handling confidence-weighted extractions 

So far, we have assumed that each extractor returns a binary de¬ 
cision about whether it extracts a triple or not, Xewdv E { 0 , 1 }. 
However, in real life, extractors return confidence scores, which 
we can interpret as the probability that the triple is present on the 
page according to that extractor. Let us denote this “soft evidence” 
by p{Xewdv = 1) = Xewdv C [ 0 , 1 ]. A simple way to handle 
such data is to binarize it, by thresholding. However, this loses 
information, as shown in the following example. 


Example 3.4. Consider the case that Ei and E3 are not fully 
confident with their extractions from W 3 and Wa- In particular, 
El gives each extraction a probability (i.e., confidence) .85, and 
E 3 gives probability .5. Although no extractor has full confidence 
for the extraction, after observing their extractions collectively, we 
would be fairly confident that W3 and Wa indeed provide triple 
T = (Obama, nationality, USA). 

However, if we simply apply a threshold of. 7, we would ignore 
the extractions from W3 and Wa by E3. Because of lack of extrac¬ 
tion, we would conclude that neither W3 nor Wa provides T. Then, 
since USA is provided by Wi and W2, whereas Kenya is provided 
by W 5 and Wq, and the sources all have the same accuracy, we 
would compute an equal probability for USA and for Kenya. □ 

Eollowing the same approach as in Equation ( [^ , we propose to 
modify Equation IE} as follows: 

VCC'{w,d,v) ^ Y, Ipi^ewt = l)Pree +p{Xe^t = 0)AbSe] 

( 31 ) 
















Similarly, we modify the precision and recall estimates: 


Pe 

Re 


~ Ppi^wdv — l\X) 
= Pp(Cwdv — 1 |X) 


4. DYNAMICALLY SELECTING GRANU¬ 
LARITY 

This section describes the choice of the granularity for web sources; 
at the end of this section we discuss how to apply it to extractors. 
This step is conducted before applying the multi-layer model. 

Ideally, we wish to use the finest granularity. For example, it is 
natural to treat each webpage as a separate source, as it may have 
a different accuracy from other webpages. We may even define a 
source as a specific predicate on a specific webpage; this allows 
us to estimate how trustworthy a page is about a specific kind of 
predicate. However, when we define sources too finely, we may 
have too little data to reliably estimate their accuracies; conversely, 
there may exist sources that have too much data even at their finest 
granularity, which can cause computational bottlenecks. 

To handle this, we wish to dynamically choose the granularity of 
the sources. For too small sources, we can “back off’ to a coarser 
level of the hierarchy; this allows us to “borrow statistical strength” 
between related pages. For too large sources, we may choose to 
split it into multiple sources and estimate their accuracies indepen¬ 
dently. When we do merging, our goal is to improve the statistical 
quality of our estimates without sacrificing efficiency. When we do 
splitting, our goal is to significantly improve efficiency in presence 
of data skew, without changing our estimates dramatically. 

To be more precise, we can define a source at multiple levels 
of resolution by specifying the following values of a feature vec¬ 
tor: (website, predicate, webpage), ordered from most 
general to most specific. We can then arrange these sources in 
a hierarchy. For example, {wiklcom) is a parent of {wiki.com, 
date_of_birth), which in turn is a parent of {wiki.com, date_of_birth, 
wiki.com/pagel.html). We define the following two operators. 

• Split: When we split a large source, we wish to split it ran¬ 
domly into sub-sources of similar sizes. Specifically, let W 
be a source with size \W\, and M be the maximum size 
we desire; we uniformly distribute the triples from W into 

buckets, each representing a sub-source. We set M 
to a large number that does not require splitting sources un¬ 
necessarily and meanwhile would not cause computational 
bottleneck according to the system performance. 

• Merge: When we merge small sources, we wish to merge 
only sources that share some common features, such as shar¬ 
ing the same predicate, or coming from the same website; 
hence we only merge children with the same parent in the 
hierarchy. We set m to a small number that does not require 
merging sources unnecessarily while maintaining enough sta¬ 
tistical strength. 

Example 4.1. Consider three sources: (websitel.com, 
date_of_birth), (websitel.com, place_of_birth), (websitel.com, 
gender), each with two triples, arguably not enough for quality 
evaluation. We can merge them into their parent source by remov¬ 
ing the second feature. We then obtain a source (websitel.com) 
with size 2*3 = 6, which gives more data for quality evaluation. 

□ 


Algorithm 2: SplitAndMerge(W, m, M) 


Input : W: sources with finest granularity; 

m/M: min/max source size in desire. 
Output: W': a new set of sources with desired size. 

1 W' ^ 0; 

2 for VF G W do 

3 W^W\{W}; 

4 if|VF|>Mthen 

5 L W' ^ W'USPLlT(iy); 

6 else if IFFI < m then 

7 VFpar ^ GetParent (FF); 

8 if FFpar =-L then 

9 // Already reach the top of the hierarchy 
L w' ^ W' U {FF}; 


10 

11 


else 

L w 


WU {H^par}; 


12 

13 


else 

L w' ^ W' U {FF}; 


14 return W'; 


Note that when we merge small sources, the result parent source 
may not be of desired size: it may still be too small, or it may be too 
large after we merge a huge number of small sources. As a result, 
we might need to iteratively merge the resulting sources to their 
parents, or splitting an oversized resulting source, as we describe 
in the full algorithm. 

Algorithmic gives the SplitAndMerge algorithm. We use W 
for sources for examination and W' for final results; at the begin¬ 
ning W contains all sources of the finest granularity and W' = 0 
(EnjC- We consider each FF G W (EnjC- If FF is too large, we 
apply Split to split it into a set of sub-sources; Split guarantees 
that each sub-source would be of desired size, so we add the sub¬ 
sources to W' (Ln|^. If FF is too small, we obtain its parent source 
(Ln[C. In case FF is already at the top of the source hierarchy so 
it has no parent, we add it to W' (LnS; otherwise, we add FFpar 
back to W (Ln|Cl. Finally, for sources already in desired size, we 
move them directly to W' (Ln[Cl. 

Example 4.2. Consider a set of1000 sources (FF, URLi), 
i G [1,1000]; in other words, they belong to the same website, each 
has a different predicate and a different URL. Assuming we wish to 
have sources with size in [5, 500], MultiLAYERSM proceeds in 
three stages. 

In the first stage, each source is deemed too small and is re¬ 
placed with its parent source {W, Pi). In the second stage, each 
new source is still deemed too small and is replaced with its par¬ 
ent source (FF). In the third stage, the single remaining source is 
deemed too large and is split uniformly into two sub-sources. The 
algorithm terminates with 2 sources, each of size 500. □ 

Einally, we point out that the same techniques apply to extractors 
as well. We define an extractor using the following feature vector, 
again ordered from most general to most specific: (extractor, 
pattern, predicate, website). The finest granularity 
represents the quality of a particular extractor pattern (different pat¬ 
terns may have different quality), on extractions for a particular 
predicate (in some cases when a pattern can extract triples of dif¬ 
ferent predicates, it may have different quality), from a particular 
website (a pattern may have different quality on different websites). 










5. EXPERIMENTAL RESULTS 

This section describes our experimental results on a synthetic 
data set (where we know the ground truth), and on large-scale real- 
world data. We show that (1) our algorithm can effectively estimate 
the correctness of extractions, the truthfulness of triples, and the 
accuracy of sources; (2) our model significantly improves over the 
state-of-the-art methods for knowledge fusion; and (3) KBT pro¬ 
vides a valuable additional signal for web source quality. 

5.1 Experiment Setup 

5.1.1 Metrics 

We measure how well we predict extraction correctness, triple 
probability, and source accuracy. For synthetic data, we have the 
benefit of ground truth, so we can exactly measure all three aspects. 
We quantify this in terms of square loss; the lower the square loss, 
the better. Specifically, SqV measures the average square loss be¬ 
tween p{Vd = v\X) and the true value of I{Vd = v); SqC mea¬ 
sures the average square loss between p(Cwdv = 11^) and the true 
value of = 1)’ and SqA measures the average square loss 

between Aw and the true value of . 

For real data, however, as we show soon, we do not have a gold 
standard for source trustworthiness, and we have only a partial gold 
standard for triple correctness and extraction correctness. Hence 
for real data, we just focus on measuring how well we predict triple 
truthfulness. In addition to SqV, we also used the following three 
metrics for this purpose, which were also used in (TT). 

• Weighted deviation (WDev): WDev measures whether the 
predicted probabilities are calibrated. We divide our triples 
according to the predicted probabilities into buckets [0,0.01), 

..., [0.04,0.05), [0.05, 0.1),..., [0.9, 0.95), [0.95, 0.96),..., 
[0.99,1), [1,1] (most triples fall in [0,0.05) and [0.95,1], so 
we used a finer granularity there). For each bucket we com¬ 
pute the accuracy of the triples according to the gold stan¬ 
dard, which can be considered as the real probability of the 
triples. WDev computes the average square loss between the 
predicted probabilities and the real probabilities, weighted 
by the number of triples in each bucket; the lower the better. 

• Area under precision recall curve (AUC-PR) : AUC-PR mea¬ 
sures whether the predicted probabilities are monotonic. We 
order triples according to the computed probabilities and plot 
PR-curves, where the X-axis represents the recall and the Y- 
axis represents the precision. AUC-PR computes the area- 
under-the-curve; the higher the better. 

• Coverage (Cov): Cov computes for what percentage of the 
triples we compute a probability (as we show soon, we may 
ignore data from a source whose quality remains at the de¬ 
fault value over all the iterations). 

Note that on the synthetic data Cov is 1 for all methods, and the 
comparison of different methods regarding AUC-PR and WDev is 
very similar to that regarding SqV, so we skip the plots. 

5 . 7.2 Methods being compared 

We compared three main methods. The first, which we call SlN- 
GLELayer, implements the state-of-the-art methods for knowl¬ 
edge fusion GD (overviewed in Section |^. In particular, each 
source or “provenance” is a 4-tuple (extractor, website, 
predicate, pattern) . We consider a provenance in fusion 
only if its accuracy does not remain default over iterations because 
of low coverage. We set n = 100 and iterate 5 times. These set¬ 
tings have been shown in (D to perform best. 


The second, which we call MULTILAYER, implements the multi¬ 
layer model described in Section To have reasonable execution 
time, we used the finest granularity specified in Sectionj^for extrac¬ 
tors and sources: each extractor is an (extractor, pattern, 
predicate, website) vector, and each source is a (website, 
predicate, webpage) vector. When we decide extraction 
correctness, we consider the confi denc e provided by extractors, 
normalized to [0,1], as in Section 


3.5 


If an extractor does not 


provide confidence, we assume the confidence is 1. When we de¬ 
cide triple truthfulness, by default w e use t he improved estimate 
p{Cwdv = 1|^) described in Section [5.3.3| instead of simply us¬ 


ing Cwdv We start updating the prior probabilities p{Cwdv = 1). 
as described in Section [T.3.4| starting from the third iteration, since 
the probabilities we compute get stable after the second iteration. 
For the noise models, we set n = 10 and 7 = 0.25, but we found 
other settings lead to quite similar results. We vary the settings and 
show the effect in Section [533] 

The third method, which we call MultiLAYERSM, implements 
the SplitAndMerge algorithm in addition to the multi-layer model, 
as described in Section |4] We set the min and max sizes to m = 5 
and M = 10 A" by default, and varied them in Section [53^ 

For each method, there are two variants. The first variant de¬ 
termines which version of the p{Xewdv\Cwdv) model we use. We 
tried both ACCU and POPACCU. We found that the performance of 
the two variants on the single-layer model was very similar, while 
PopAccu is slightly better. However, rather surprisingly, we found 
that the POPAccu version of the multi-layer model was worse than 
the Accu version. This is because we have not yet found a way to 
combine the PoPAccu model with the improved estimation pro¬ 
cedure described in Section |3.3.3| Consequently, we only report 
results for the Accu version in what follows. 

The second variant is how we initialize source quality. We either 
assign a default quality (Aw = 0.8, Ae = 0.8, (Je = 0.2) or ini¬ 
tialize the quality according to a gold standard, as explained in Sec- 
tion |5.3| In this latter case, we append + to the method name to dis¬ 
tinguish it from the default initialization (e.g., SingleLayer-i-). 


5.2 Experiments on synthetic data 


5.2.1 Data set 

We randomly generated data sets containing 10 sources and 5 
extractors. Each source provides 100 triples with an accuracy of 
A = 0.7. Each extractor extracts triples from a source with prob¬ 
ability S = 0.5; for each source, it extracts a provided triple with 
probability R = 0.5; accuracy among extracted subjects (same for 
predicates, objects) is P = 0.8 (in other words, the precision of 
the extractor is Pe = P^). In each experiment we varied one pa¬ 
rameter from 0.1 to 0.9 and fixed the others; for each experiment 
we repeated 10 times and reported the average. Note that our de¬ 
fault setting represents a challenging case, where the sources and 
extractors are of relatively low quality. 

5.2.2 Results 

Figure plots SqV, SqC, and SqA as we increase the number 
of extractors. We assume SingleLayer considers all extracted 
triples when computing source accuracy. We observe that the multi¬ 
layer model always performs better than the single-layer model. 
As the number of extractors increases, SqV goes down quickly for 
the multi-layer model, and SqC also decreases, albeit more slowly. 
Although the extra extractors can introduce much more noise ex¬ 
tractions, SqA stays stable for MULTILAYER, whereas it increases 
quite a lot for SingleLayer. 















Figure 3: Error in estimating V^, C^dv and Aw as we vary the number of extractors in the synthetic data. The multi-layer model has significantly lower 
square loss than the single-layer model. The single-layer model cannot estimate Cwdv^ resulting with one line for SqC. 
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Figure 4: Error in estimating V^, Cwdv and Aw as we vary extractor quality (P and R) and source quality (A) in the synthetic data. 


Next we vary source and extractor quality. MULTILAYER con¬ 
tinues to perform better than SingleLayer everywhere and Fig¬ 
ure [^plots only for MULTILAYER as we vary R, P and A (the plot 
for varying S is similar to that for varying R). In general the higher 
quality, the lower the loss. There are a few small deviations from 
this trend. When the extractor recall (R) increases, SqA does not 
decrease, as the extractors also introduce more noise. When the ex¬ 
tractor precision (P) increases, we give them higher trust, resulting 
in a slightly higher (but still low) probability for false triples; since 
there are many more false triples than true ones, SqV slightly in¬ 
creases. Similarly, when A increases, there is a very slight increase 
in SqA, because we trust the false triples a bit more. However, over¬ 
all, we believe the experiments on the synthetic data demonstrate 
that our algorithm is working as expected, and can successfully ap¬ 
proximate the true parameter values in these controlled settings. 

5.3 Experiments on KV data 

5.3.1 Data set 

We experimented with knowledge triples collected by Knowl¬ 
edge Vault p0| on 7/24/2014; for simplicity we call this data set 
KV. There are 2.8B triples extracted from 2B-i- webpages by 16 ex¬ 
tractors, involving 40M extraction patterns. Comparing with an old 
version of the data collected on 10/2/2013 m , the current collec¬ 
tion is 75% larger, involves 25% more extractors, 8% more extrac¬ 
tion patterns, and twice as many webpages. 

Figure shows the distribution of the number of distinct ex¬ 
tracted triples per URL and per extraction pattern. On the one hand, 
we observe some huge sources and extractors: 26 URLs each con¬ 
tributes over 50K triples (a lot due to extraction mistakes), 15 web¬ 
sites each contributes over lOOM triples, and 43 extraction patterns 
each extracts over IM triples. On the other hand, we observe long 
tails: 74% URLs each contributes fewer than 5 triples, and 48% 
extraction patterns each extracts fewer than 5 triples. Our SPLI- 
tAndMerge strategy is exactly motivated by such observations. 

To determine whether these triples are true or not (gold stan¬ 
dard labels), we use two methods. The first method is called the 


Table 5: Comparison of various methods on KV; best performance 
in each group is in bold. For SqV and WDev, lower is better; for 
AUC-PR and Cov, higher is better. 


SqV 

SqV 

WDev 

AUC-PR 

Cov 

SingleLayer 

0.131 

0.061 

0.454 

0.952 

Multilayer 

0.105 

0.042 

0.439 

0.849 

MultiLayerSM 

0.090 

0.021 

0.449 

0.939 

SingleLayer-h 

0.063 

0.0043 

0.630 

0.953 

MultiLayer-1- 

0.054 

0.0040 

0.693 

0.864 

MultiLayerSM-i- 

0.059 

0.0039 

0.631 

0.955 


Local-Closed World Assumption (LCWA) |[^[TT][^ and works 
as follows. A triple (s,p, o) is considered as true if it appears in 
the Freebase KB. If the triple is missing from the KB but (s,p) ap¬ 
pears for any other value o\ we assume the KB is locally complete 
(for (s,p)), and we label the (s,p,o) triple as false. We label 
the rest of the triples (where (s,p) is missing) as unknown and 
remove them from the evaluation set. In this way we can decide 
truthfulness of 0.74B triples (26% in KV), of which 20% are true 
(in Freebase). 

Second, we apply type checking to find incorrect extractions. In 
particular, we consider a triple (s,p,o) as false if 1) s = o; 
2) the type of s or o is incompatible with what is required by the 
predicate; or 3) o is outside the expected range (e.g., the weight of 
an athlete is over 1000 pounds). We discovered 0.56B triples (20% 
in KV) that violate such rules and consider them both as false 
triples and as extraction mistakes. 

Our gold standard include triples from both labeling methods. It 
contains in total 1.3B triples, among which 11.5% are true. 


5.5.2 Single-layer vs multi-layer 
Table|^compares the performance of the three methods. Figurej^ 
plots the calibration curve and Figure [^plots the PR-curve. We see 
that all methods are fairly well calibrated, but the multi-layer model 
has a better PR curve. In particular, SingleLayer often predicts a 
low probability for true triples and hence has a lot of false negatives. 
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Figure 5: Distribution of #Triples per URL or ex¬ 
traction pattern motivates SplitAndMerge. 


Figure 6: Distribution of predicted extraction cor¬ 
rectness shows effectiveness of MultiLayer-i-. 


Figure 7: Distribution on KBT for websites with 
at least 5 extracted triples. 
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Figure 8: Calibration curves for various methods 
on KV data. 


Figure 9: PR-curves for various methods on KV 
data. MultiLayere has the best curve. 


Figure 10: KBT and PageRank are orthogonal 
signals. 


We see that MultiLAYERSM has better results than Multi- 
Layer, but surprisingly, MultiLAYERS M+ has lower performance 
than MultiLAYER+. That is, there is an interaction between the 
granularity of the sources and the way we initialize their accuracy. 

The reason for this is as follows. When we initialize source 
and extractor quality using default values, we are using unsuper¬ 
vised learning (no labeled data). In this regime, MULTILAYERSM 
merges small sources so it can better predict their quality, which is 
why it is better than standard MULTILAYER. Now consider when 
we initialize source and extractor quality using the gold standard; in 
this case, we are essentially using semi-supervised learning. Smart 
initialization helps the most when we use a fine granularity for 
sources and extractors, since in such cases we often have much 
fewer data for a source or an extractor. 

Finally, to examine the quality of our prediction on extraction 
correctness (recall that we lack a full gold standard), we plotted the 
distribution of the predictions on triples with type errors (ideally we 
wish to predict a probability of 0 for them) and on correct triples 
(presumably a lot of them, though not all, would be correctly ex¬ 
tracted and we shall predict a high probability). Figurej^ shows the 
results by MultiLAYER+. We observe that for the triples with type 
errors, MultiLayere predicts a probability below 0.1 for 80% of 
them and a probability above 0.7 for only 8%; in contrast, for the 
correct triples in Freebase, MultiLayere predicts a probability 
below 0.1 for 26% of them and a probability above 0.7 for 54%, 
showing effectiveness of our model. 

5.3.3 Effects of varying the inference algorithm 
Tablej^shows the effect of changing different pieces of the multi¬ 
layer inference algorithm, as follows. 

Row p{Vd\Cd) shows the change we incur by treating Cd as ob¬ 
served data when inferring Vd (as described in Section |3.3.2|), as 
opposed to using the confidence-weighted version in Section |3.3.3| 
We see a significant drop in the AUC-PR metric and an increase in 


Table 6: Contribution of different components, where significantly 
worse values (compared to the baseline) are shown in italics. 


SqV 

SqV 

WDev 

AUC-PR 

Cov 

MultiLayere 

0.054 

0.0040 

0.693 

0.864 

p{Vd\Cd) 

0.061 

0.0038 

0.570 

0.880 

Not updating a 

0.055 

0.0057 

0.699 

0.864 

p{Cdwv\^{Xewdv ^ 0)) 

0.053 

0.0040 

0.696 

0.864 


SqV by ignoring uncertainty in Cd \ indeed, we predict a probability 
below 0.05 for the truthfulness of 93% triples. 

Row “Not updating a” shows the change we incur if we keep 
p{Cwdv = 1) fixed at a, as opposed to using the updating scheme 
described in Section [J. 3. 4| We see that most metrics are the same, 
but WDev has gotten significantly worse, showing that the prob¬ 
abilities are less well calibrated. It turns out that not updating the 
prior often results in over-confidence when computing p{Vd\X), as 
shown in Example [33] 

Row p{Cdwv\^{Xewdv > 0)) shows the change we incur by 
thresholding the confidence-weighted extractions at a threshold of 
0 = 0, as opposed to using the confidence-weighted extension in 
Section |3.5| Rather surprisingly, we see that thresholding seems 
to work slightly better; however, this is consistent with previous 
observations that some extractors can be bad at predicting confi¬ 
dence m 

5.3.4 Computational efficiency 

All the algorithms were implemented in Flume Java ||^, which is 
based on Map-Reduce. Absolute running times can vary dramati¬ 
cally depending on how many machines we use. Therefore, Table^ 
shows only the relative efficiency of the algorithms. We reported 
the time for preparation, including applying splitting and merging 
on web sources and on extractors; and the time for iteration, includ¬ 
ing computing extraction correctness, computing triple truthful- 














































































































Table 7: Relative running time, where we consider one iteration of 
Multilayer as taking 1 unit of time. We see that using split and 
split-merge is, on average, 3 times faster per iteration. 


Task 

Normal 

Split 

Split&Merge 


Source 

0 

0.28 

0.5 

Prep. 

Extractor 

0 

0.50 

0.46 


Total 

0 

0.779 

1.034 


I. ExtCorr 

0.097 

0.098 

0.094 


II. TriplePr 

0.098 

0.079 

0.087 

Iter. 

III. SrcAccu 

0.105 

0.080 

0.074 


IV. ExtQuality 

0.700 

0.082 

0.074 


Total 

1 

0.337 

0.329 

Total 

5 

2.466 

2.679 


ness, computing source accuracy, and computing extractor quality. 
For each component in the iterations, we report the average execu¬ 
tion time among the five iterations. By default m = 5, M = lOTf. 

First, we observe that splitting large sources and extractors can 
significantly reduce execution time. In our data set some extractors 
extract a huge number of triples from some websites. Splitting such 
extractors has a speedup of 8.8 for extractor-quality computation. 
In addition, we observe that splitting large sources also reduces ex¬ 
ecution time by 20% for source-accuracy computation. On average 
each iteration has a speed up of 3. Although there is some overhead 
for splitting, the overall execution time dropped by half. 

Second, we observe that applying merging in addition does not 
add much overhead. Although it increases preparation by 33%, it 
drops the execution time in each iteration slightly (by 2.4%) be¬ 
cause there are fewer sources and extractors. The overall execu¬ 
tion time increases over splitting by only 8.6%. Instead, a baseline 
strategy that starts with the coarsest granularity and then splits big 
sources and extractors slows down preparation by 3.8 times. 

Finally, we examined the effect of the m and M parameters. We 
observe that varying M from IK to 50K affects prediction quality 
very little; however, setting M — IK (more splitting) slows down 
preparation by 19% and setting M = 50K (less splitting) slows 
down the inference by 21%, so both have longer execution time. 
On the other hand, increasing m to be above 5 does not change the 
performance much, while setting m — 2 (less merging) increases 
wDev by 29% and slows down inference by 14%. 

5.4 Experiments related to KBT 

We now evaluate how well we estimate the trustworthiness of 
webpages. Our data set contains 2B-i- webpages from 26M web¬ 
sites. Among them, our multi-layer model believes that we have 
correctly extracted at least 5 triples from about 119M webpages 
and 5.6M websites. Figure[7] shows the distribution of KBT scores: 
we observed that the peak is at 0.8 and 52% of the websites have a 
KBT over 0.8. 

5.4.1 KBT PageRank 

Since we do not have ground truth on webpage quality, we com¬ 
pare our method to PageRank. We compute PageRank for all web¬ 
pages on the web, and normalize the scores to [0,1]. Figurep^plots 
KBT and PageRank for 2000 randomly selected websites. As ex¬ 
pected, the two signals are almost orthogonal. We next investigate 
the two cases where KBT differs significantly from PageRank. 

Low PageRank but high KBT (bottom-right corner): To under¬ 
stand which sources may obtain high KBT, we randomly sampled 
100 websites whose KBT is above 0.9. The number of extracted 
triples from each website varies from hundreds to millions. For 
each website we considered the top 3 predicates and randomly se¬ 
lected from these predicates 10 triples where the probability of the 


extraction being correct is above 0.8. We manually evaluated each 
website according to the following 4 criteria. 

• Triple correctness: whether at least 9 triples are correct. 

• Extraction correctness: whether at least 9 triples are cor¬ 
rectly extracted (and hence we can evaluate the website ac¬ 
cording to what it really states). 

• Topic relevance: we decide the major topics for the website 
according to the website name and the introduction in the 
‘About us” page; we then decide whether at least 9 triples are 
relevant to these topics {e.g., if the website is about business 
directories in South America but the extractions are about 
cities and countries in SA, we consider them as not topic 
relevant). 

• Non-trivialness: we decide whether the sampled triples state 
non-trivial facts {e.g., if most sampled triples from a Hindi 
movie website state that the language of the movie is Hindi, 
we consider it as trivial). 

We consider a website as truly trustworthy if it satisfies all of 
the four criteria. Among the 100 websites, 85 are considered trust¬ 
worthy; 2 are not topic relevant, 12 do not have enough non-trivial 
triples, and 2 have more than 1 extraction errors (one website has 
two issues). However, only 20 out of the 85 trustworthy sites have 
a PageRank over 0.5. This shows that KBT can identify sources 
with trustworthy data, even though they are tail sources with low 
PageRanks. 

High PageRank but low KBT (top-left corner): We consider the 
15 gossip websites listed in fT^ . Among them, 14 have a PageR¬ 
ank among top 15% of the websites, since such websites are often 
popular. However, for all of them the KBT are in the bottom 50%; 
in other words, they are considered less trustworthy than half of the 
websites. Another kind of websites that often get low KBT are fo¬ 
rum websites. For instance, we discovered that answers.^hoo.com 
says that “Catherine Zeta-Jones is from New Zealand”^ although 
she was born in Wales according to Wikipedi^ 

5.4.2 Discussion 

Although we have seen that KBT seems to provide a useful sig¬ 
nal about trustworthiness, which is orthogonal to more traditional 
signals such as PageRank, our experiments also show places for 
further improvement as future work. 

1. To avoid evaluating KBT on topic irrelevant triples, we need 
to identify the main topics of a website, and filter triples 
whose entity or predicate is not relevant to these topics. 

2. To avoid evaluating KBT on trivial extracted triples, we need 
to decide whether the information in a triple is trivial. One 
possibility is to consider a predicate with a very low variety 
of objects as less informative. Another possibility is to asso¬ 
ciate triples with an IDF (inverse document frequency), such 
that low-IDF triples get less weight in KBT computation. 

3. Our extractors (and most state-of-the-art extractors) still have 
limited extraction capabilities and this limits our ability to 
estimate KBT for all websites. We wish to increase our KBT 
coverage by extending our method to handle open-IE style 
information extraction techniques, which do not conform to 
a schema (E) However, although these methods can extract 
more triples, they may introduce more noise. 

4. Some websites scrape data from other websites. Identify¬ 
ing such websites requires techniques such as copy detec¬ 
tion. Scaling up copy detection techniques, such as j?] [^, 

^https://answers.yahoo.coni/question/index?qid=20070206090808AAC54nH. 
^ http: //en. wikipedia. org/wiki/Catherine _Zeta-Jones. 













has been attempted in p^ , but more work is required be¬ 
fore these methods can be applied to analyzing extracted data 
from billions of web sources. 

6 . RELATED WORK 

There has been a lot of work studying how to assess quality of 
web sources. PageRank ||4j and Authority-hub analysis con¬ 
sider signals from link analysis (surveyed in Q). EigenTrust 
and TrustMe |28| consider signals from source behavior in a P2P 
network. Web topology TrustRank fTT) , and AntiTrust pO) 
detect web spams. The knowledge-based trustworthiness we pro¬ 
pose in this paper is different from all of them in that it considers 
an important endogenous signal — the correctness of the factual 
information provided by a web source. 

Our work is relevant to the body of work in Data fusion (sur¬ 
veyed in |[^[^|^), where the goal is to resolve conflicts from data 
provided by multiple sources and And the truths that are consistent 
with the real world. Most of the recent work in this area considers 
trustworthiness of sources, measured by link-based measures 24 
, IR-based measures p9) , accuracy-based measure s |[8| [^ 13 
[^1^, and graphical-model analysis p^[3T][^|3^ How¬ 
ever, these papers do not model the concept of an extractor, and 
hence they cannot distinguish an unreliable source from an unreli¬ 
able extractor. 

Graphical models have been proposed to solve the data fusion 
problem p^[3T][^[^ . These models are more or less similar to 
our single-layer model in Section p^ in particular, p^ considers 
single truth , p^ considers numerical values, p^ allows multiple 
truths, and |31| considers correlations between the sources. How¬ 
ever, these prior works do not model the concept of an extractor, 
and hence they cannot capture the fact that sources and extractors 
introduce qualitatively different kinds of noise. In addition, the data 
sets used in their experiments are typically 5-6 orders of magnitude 
smaller in scale than ours, and their inference algorithms are in¬ 
herently slower than our algorithm. The multi-layer model and the 
scale of our experimental data also distinguish our work from other 
data fusion techniques. 

Finally, the most relevant work is our previous work on knowl¬ 
edge fusion (TT) We have given detailed comparison in Section p3] 
as well as empirical comparison in Sectionshowing that MUL¬ 
TILAYER improves over SingleLayer for knowledge fusion and 
gives the opportunity of evaluating KBT for web source quality. 

7. CONCLUSIONS 

This paper proposes a new metric for evaluating web-source quality- 
knowledge-based trust. We proposed a sophisticated probabilis¬ 
tic model that jointly estimates the correctness of extractions and 
source data, and the trustworthiness of sources. In addition, we pre¬ 
sented an algorithm that dynamically decides the level of granular¬ 
ity for each source. Experimental results have shown both promise 
in evaluating web source quality and improvement over existing 
techniques for knowledge fusion. 
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